Brand*     Richard    J 
363.73  Response     to     the 

H2rerp        expert     report    of 
19S5  Paul     Switzer    dated 

Jul*   199s  STATE  OF  MONTANA 

L  RESOURCE  DAMAGE  PROGRAM 


RESPONSE  TO  THE  EXPERT  REPORT  OF  PAUL  SWITZER 


PREPARED  BY: 

DR.  RICHARD  BRAND 

DR.  LYMAN  McDONALD 


OCTOBER  20,  1995 


;      t-'  v."    '■        :. 


?  F  T ' 

STATE  DOCUMENTS  COLLECTION 

AUG'  1 1  1999 


MONTANA  STATE  LIBRARY 

1515  E.  6th  AVE. 
HELENA,  MONTANA  59S2Q 


MONTANA  STATE  LIBRARY 

«  Ml  I I'l '111 J L 

3  0864  0014    4549  6 


S         Brandt  Richard  J 
363.73      Response  to  the 
H2rerp   expert  report  of 

1995      Paul  Ewitzer  dated 
July  1£C5 

; 

" 


MONTANA  STATE  LIBRARY 

1515  East  6th  Avenue 

Helena,  MT  59620-1800 

AUG  1 1  1999 


UNITED  STATES  DISTRICT  COURT 
DISTRICT  OF  MONTANA,  HELENA  DIVISION 


STATE  OF  MONTANA 

v. 

ATLANTIC  RICHFIELD  COMPANY 

No.  CV-83-317-HLN-PGH 


RESPONSE  TO  THE  EXPERT  REPORT  OF  PAUL  SWITZER 

DATED  JULY  1995 


Prepared  by: 


DR.  RICHARD  J.  BRAND 

AND 

DR.  LYMAN  L.  MCDONALD 

OCTOBER  20,  1995 


Testifying  Expert: 


Titian  L.  McDonald,  Ph.D. 
WEST,  Inc. 
2003  Central  Avenue 
Cheyenne,  Wyoming  82001 


OCT-16  55  14:26  FROM:MT  DOJ'NRDP  406-444-0236         TO: 510  643  5163 


PAGE: 02 


UNITED  STATES  DISTRICT  COURT 
DISTRICT  OF  MONTANA,  HELENA  DIVISION 


STATE  OF  MONTANA 

v. 

ATLANTIC  RICHFIELD  COMPANY 

No.  CV-83-317-HLN-PGH 


RESPONSE  TO  THE  EXPERT  REPORT  OF  PAUL  SWITZER 

DATED  JULY  1995 


PREPARED  BY 


DR.  RICHARD  J.  BRAND 

AND 
DR.  LYMAN  L.  McDONALD 

OCTOBER  20,  1995 


Testifying   Expert: 


Richard  J.  Brand 


Lyman  L.  McDonald 


PREFACE 


This  report  was  prepared  in  a  joint  effort  by  Dr.  Brand  and 
Dr.  McDonald.  Dr.  Brand's  opinions  are  contained  in  all  portions 
of  this  report  except  in  those  sections  or  paragraphs  within 
sections  which  are  designated  with  "[Dr.  McDonald]."  Dr. 
McDonald's  opinions  are  contained  in  all  portions  of  this  report 
except  in  those  sections  or  paragraphs  within  sections  which  are 
designated  with  "[Dr.  Brand]."  This  report  offers  expert  opinions 
in  response  to  the  opinions  expressed  in  Dr.  Paul  Switzer's  Expert 
Report  (July  1995) .  Insofar  as  the  statistical  issues  raised  in 
Dr.  Switzer's  report  are  also  raised  in  ARCO's  other  expert 
reports,  the  opinions  expressed  herein  are  also  applicable  thereto. 


TABLE  OP  CONTENTS 


II 


INTRODUCTION  AND  GENERAL  OBSERVATIONS  ABOUT  THE 
ARCO  STATISTICAL  REPORT 

A.  The  Speculative  Nature  of  the  ASR 

B.  The  Choice  of  Appropriate  Statistical  Technology 

REFERENCE  AREAS  USED  FOR  COMPARISON  WITH  THE 
INJURED  AREAS 


III.  SAMPLE  DATA  USED  WITHIN  THE  INJURED  AND  CONTROL  AREAS 

IV.  MEASUREMENT  ISSUES 

A.  Historical  and  Recent  Data 

B.  Grab  Samples  vs.  Four-Day  Averages 

C.  Fish  Abundance  and  Injury  Data 

D.  Terrestrial  Data 

V.  CONSISTENCY  AND  INTERPRETATION  OF  THE  STATE'S 
STATISTICAL  METHODS  AND  RESULTS 

A.  P-Values  and  Alpha  Levels 

B.  Confidence  Intervals  and  Standard  Errors 

C.  Mann-Whitney  and  Ranked  Squares  Tests 

D.  Multiple  Comparisons 

VI.  ALTERNATE  STATISTICAL  METHODS 

A.  Different  Pooling-Weighting  of  Repeat  Measurements 

B.  Adjustment  for  Covariables 

C.  Spatial  Analysis 


1 
1 
2 

3 

5 
7 
8 
9 
11 
12 

13 
13 
15 
17 
18 
20 
21 
22 
23 


Digitized  by  the  Internet  Archive 

in  2011  with  funding  from 

Montana  State  Library 


http://www.archive.org/details/responsetoex199520bran 


INTRODUCTION  AND  GENERAL  OBSERVATIONS  ABOUT  THE 
ARCO  STATISTICAL  REPORT 


The  report  by  Paul  Switzer,  dated  July  1995,  is  concerned  with 
various  statistical  issues  connected  with  the  determination  or 
quantification  of  injury  to  groundwater  (G) ,  aquatic  resources  (A) 
or  terrestrial  (T)  resources.  Dr.  Switzer's  report,  which  we  will 
henceforth  refer  to  as  the  ARCO  statistical  report  (ASR) ,  contains 
18  summary  points  (S1-S18) ,  13  overall  comments  (01-013),  9 
comments  specific  to  groundwater  (G1-G9) ,  12  comments  related  to 
aquatic  resources  (A1-A12) ,  and  8  comments  specific  to  terrestrial 
resources  (T1-T8) .  A  content  analysis  of  the  ASR  reveals  a 
considerable  amount  of  redundancy.  Some  issues  are  raised  more 
than  once,  with  different  wording  or  a  slightly  different  slant, 
but  with  essentially  the  same  content. 

In  this  report  we  identify  and  respond  to  the  key  non-redundant 
issues  raised  by  the  ASR. 

A.    The  Speculative  Nature  of  the  ASR 

A  general  theme  in  the  ASR  is  based  on  the  following  logic:  First, 
bias  may  happen.  There  are  some  possible  ways  in  which  the  process 
used  to  investigate  the  differences  between  injured  areas  and  their 
corresponding  reference  areas  could  be  biased.  That  is,  the  process 
could  tend  to  give  answers  that  systematically,  on  the  average, 
differ  from  the  true  difference  due  to  mining  related  influences. 
The  logic  then  proceeds  with  two  unstated  assumptions:  (a)  Bias 
of  a  large  enough  size  to  alter  the  substance  of  the  conclusions 
has  actually  occurred;  and  (b)  the  direction  of  the  bias, 
presumed  to  have  occurred,  would  necessarily  exaggerate  (rather 
than  diminish)  the  observed  difference  between  the  injured  and 
reference  area.  (In  other  words,  it  would  exaggerate  the  State's 
claim.)  In  this  way  the  ASR  hopes  to  cast  doubt  on  the  relevance 
of  the  findings  of  the  State's  injury  investigations. 

A  related  theme  in  the  ASR  is  that  analysis  methods  used  by  the 
State  are  not  sufficiently  sophisticated.  Connected  with  this 
claim  is  the  strong  suggestion  that  use  of  alternate  methods,  which 
in  some  instances  are  specified  in  the  ASR,  would  substantially 
change  the  results.  Further,  the  language  and  tone  of  the  report 
are  designed  to  suggest  that  the  proposed  new  methods  of  analysis 
would  give  a  new  result  that  will  diminish  the  injury  claim. 

Despite  the  various  claims  in  the  ASR,  there  is  no  new  data 
reported  and  analyzed,  nor  is  there  any  new  analysis  of  old  data 
which  provides  any  tangible  evidence  that  any  bias  has  actually 
occurred,  or  that  if  it  has  occurred  that  it  has  in  fact 
exaggerated  the  State's  claim.   Instead  the  critique  provided  by 


the  ASR  is  purely  speculative. 

B.    The  Choice  of  Appropriate  Statistical  Technology 

Another  general  issue  raised  by  the  ASR  concerns  the  choice  of  the 
appropriate  level  of  statistical  technology  needed  for  analysis  of 
new  data  collected  in  the  Natural  Resource  Damage  Assessment  (NRDA) 
or  for  the  analysis  of  data  assembled  from  historical  reports.  The 
philosophy  adopted  by  the  State  and  by  the  ASR  differ  substantially 
on  this  point. 

The  view  inherent  in  the  ASR  is  presented  in  summary  point  S18.  It 
says-"The  suggested  statistical  methods  described  in  the  NRDA  rules 
are,  by  themselves  insufficient  for  determination  and  guantif ica- 
tion  of  injury  and  the  attribution  and  allocation  of  causes.  More 
sophisticated  analyses  are  reguired  than  those  which  the  State 
performed  to  avoid  statistical  confounding." 

We  have  a  different  view  of  the  NRDA  regulations.  We  see  the 
guidelines  for  a  type  B  injury  assessment,  which  the  State  has 
followed  as  thoroughly  as  possible,  as  a  practical  prescription  for 
a  trustee  of  public  resources  to  use  when  gathering,  analyzing,  and 
presenting  data  to  make  a  claim  for  injuries  to  the  resources  for 
which  the  trustee  has  responsibility.  Trustees  of  public  resources 
are  constrained  as  to  how  far  they  may  go  with  both  data  collection 
and  data  analysis.  There  are  repeated  reminders  in  the  NRDA 
regulations  that  expenditures  must  be  cost-effective.  Additional 
data  gathering  or  more  elaborate  data  analysis,  that  are  of 
academic  interest  but  are  not  likely  to  change  the  substance  of  the 
findings,  are  difficult  to  justify.  Adherence  to  the  type  B 
assessment  guidelines  confers  the  status  of  rebuttable  presumption. 
That  does  not  preclude  more  elaborate  data  collection  or  analysis 
by  ARCO,  but  the  speculative  criticisms  in  the  ASR  are  a  long  way 
from  the  production  of  tangible  new  evidence  about  the 
guantif ication  of  injury  to  the  resources  involved.  The  ASR  is 
designed  to  cast  doubt,  but  shed  no  new  light. 

The  issue  of  appropriate  statistical  technology  is  not  unigue  to  an 
NRDA.  In  the  academic  environment  the  exploration  of  alternate 
statistical  methodology  is  often  a  relevant  pursuit  out  of  pure 
intellectual  interest.  In  contrast,  applied  statisticians  must 
always  make  appropriate  technology  decisions  based  on  anticipated 
costs  and  benefits  of  different  approaches.  Every  applied 
statistician  makes  choices  which  involve  judgments  about  the 
likelihood  that  a  more  elaborate,  time  consuming  and  costly 
analysis  is  needed.  We  believe  that  the  authors  of  the  regulations 
were  concerned  about  such  issues  and  constructed  guidelines  that 
represented  a  sensible  balance.  The  regulations  provide  a 
guideline  for  appropriate  statistical  technology.  The  approaches 
used  by  the  State  for  the  determination  and  guantif  ication  of 
injury  were  consistent  with  these  guidelines,  sensible  in  the 


choice  of  appropriate  statistical  technology,  and  produced  results 
that  are  unlikely  to  be  altered  in  any  important  way  by  more 
elaborate  statistical  analysis. 

II.  REFERENCE  AREAS  USED  FOR  COMPARISON  WITH  THE  INJURED  AREAS 
(S1,4;  Ol, 3.4a;  G2;  A8;  T1a) 

There  is  considerable  redundancy  in  this  category  of  ASR  which 
boils  down  to  three  basic  criticisms:  (1)  Only  one  reference  area 
was  chosen.  The  ASR  recommends  that  a  collection  of  possible 
reference  areas  should  have  been  identified  from  which  one,  and 
preferably  more  than  one,  should  have  been  chosen  at  random.  Also, 
particular  attention  to  variability  between  different  reference 
areas  was  recommended  by  the  ASR.  (2)  The  State  had  some  advance 
knowledge  about  contamination  levels  or  other  relevant  variables 
for  reference  areas.  (3)  More  effort  should  have  been  made  to 
describe  "inevitable  differences"  between  the  injured  and  reference 
areas. 

The  State  has  made  a  concerted  effort  to  find  the  most  relevant 
reference  areas  based  on  similarity  between  the  injured  and 
reference  areas  with  respect  to  geology,  hydrology,  geomorphology , 
geochemistry,  habitat,  elevation,  etc.  and,  where  appropriate, 
proximity  to  the  injured  area.  The  availability  of  data  that  could 
be  used  to  quantify  injury  in  a  cost-effective  way  was  also  a 
consideration. 

For  groundwater  resources,  it  is  important  to  note  that  much  of  the 
historical  data  which  were  available  for  the  reference  areas  was 
collected  by  ARCO  in  connection  with  the  Superfund  remediation 
process.  It  is  not  too  surprising  that  many  of  these  same  areas 
were  also  judged  to  be  the  most  sensible,  appropriate  and  fair 
reference  areas  for  use  in  the  NRDA. 

A  considerable  amount  of  effort  was  required  to  search  through  the 
many  historical  reports  to  provide  reference  information  for 
comparison  with  injured  areas.  This  information  was  not  readily 
available  in  any  specific  way  when  the  reference  areas  for  stream 
reaches  and  terrestrial  injuries  were  chosen.  One  of  the  main 
concerns  in  the  selection  of  reference  areas  was  that  close 
proximity  carried  with  it  the  risk  of  using  reference  areas  which 
were  themselves  contaminated  by  the  mining  injuries  the  NRDA  was 
intended  to  quantify.  Some  nearby  streams  that  were  used  as 
reference  reaches  for  segments  of  the  Clark  Fork  River  (CFR)  and 
Silver  Bow  Creek  (SBC)  are  contaminated  by  mining  activity  not 
conducted  by  ARCO  or  its  predecessors.  The  State  took  this 
conservative  approach  in  order  to  use  reference  areas  that  were 
otherwise  similar.  In  the  study  of  the  effects  on  upland 
terrestrial  resources,  the  matching  reference  points  from  the 
German  Gulch  area  were  likely  to  have  some  effects  from  aerial 


release  of  ARCO  derived  hazardous  material  because  German  Gulch  is 
only  about  eight  miles  south.  Again,  the  State  took  this 
conservative  approach  in  order  to  use  points  from  a  reference  area 
which  is  similar  and  in  close  proximity.  In  effect,  the  State 
assumed  that  the  ARCO- related  mining  injuries  would  be  so  profound 
that  they  would  show  up  clearly  even  with  this  bias  in  favor  of 
ARCO. 

The  notion  that  the  State  should  have  developed  a  collection  of 
reference  areas  and  chosen  multiple  ones  from  them  at  random  is, 
for  several  reasons,  a  very  impractical  and  expensive  idea.  In 
order  to  have  several  candidate  reference  areas,  it  would  have  been 
necessary  to  include  areas  that  were  not  in  close  proximity  to  the 
injured  area,  and  it  would  have  been  necessary  to  use  reference 
areas  that  were  less  similar.  The  ASR  suggests  that  inter-area 
variability  of  candidate  reference  areas  should  have  been  studied. 
This  is  an  interesting  academic  idea  but  it  certainly  is  not  a 
cost-effective  way  to  establish  baseline  vs.  reference  comparisons. 
In  order  to  get  reliable  estimates  of  the  actual  variability 
between  candidate  reference  areas  a  substantial  amount  of  data 
would  have  been  needed  from  each  of  the  reference  areas.  A 
substantial  allocation  of  resources  would  have  gone  to  simply 
investigating  reference  area  variability,  knowing  from  the  outset 
that  to  even  construct  a  set  of  candidate  areas  it  was  necessary  to 
compromise  on  the  proximity  and  similarity  with  respect  to  non- 
mining  factors. 

In  contrast  to  the  impractical  and  inefficient  suggestion  of  the 
ASR,  the  approach  used  by  the  State  was  to  the  point,  in  accordance 
with  the  regulations,  and  provided  a  fair  and  sensible  way  to 
establish  injured  vs.  reference  comparisons. 

The  ASR  requests  further  description  of  the  differences  between  the 
areas.  Where  this  was  feasible  and  necessary  for  the  analysis  such 
descriptions  were  provided.  The  most  challenging  issue  raised  by 
the  ASR  about  injured  vs.  reference  comparisons  was  the  concern 
about  the  balance  between  Butte  bedrock  injured  and  reference  wells 
with  respect  to  their  mineralization  status.  The  State  has 
investigated  this  further  and  was  aided  in  this  effort  by  some 
important  mineralization  maps  that  became  available  for  the  first 
time  when  ARCO  used  them  during  a  deposition  of  one  of  the  State's 
consultants.  This  will  be  addressed  further  in  the  State's 
rebuttal  to  ARCO's  Butte  groundwater  injury  reports.   [Brand] 

In  connection  with  this  issue,  the  ASR  followed  the  general  theme 
identified  in  the  introduction  of  this  report.  The  ASR  argues  that 
bias  may  happen,  that  it  may  be  substantial,  and,  through  tone  and 
language,  implies  that  there  is  indeed  a  substantial  bias  that 
favors  the  State's  injury  assessment.  In  reality  there  is  no 
evidence  that  there  is  any  bias.  There  is  inevitably  some 
uncertainty.  If  the  assessment  is  in  error  due  to  chance  variation 
or  some  unknown  systematic  tendency  in  the  investigation  process, 


it  could  just  as  easily  favor  ARCO  as  the  State. 

The  criticism  that  only  one  reference  area  was  selected  for  the 
terrestrial  uplands  study  indicates  an  incomplete  understanding  of 
the  State's  uplands  study  design.  There  was  no  "reference  area"  in 
the  classical  sense  for  the  uplands  study.  Rather  points  were 
randomly  selected  from  the  impact  area(s)  and  matched  with  three 
points  from  the  German  Gulch  area  so  that  they  would  have  similar 
values  for  covariates  which  typically  influence  vegetation;  then, 
one  of  the  three  was  randomly  selected.  There  is  no  reference  area 
with  a  well  defined  boundary;  rather,  statistical  inferences  are  to 
the  randomly  sampled  impact  areas  and  the  written  protocol  by  which 
the  matching  was  accomplished. 

III.  SAMPLE  DATA  USED  WITHIN  THE  INJURED  AND  CONTROL  AREAS 
(05acd,8;  G1,3) 

Statistical  theory  is  based  on  the  idealization  that  a  random 
sampling  mechanism  is  used  in  the  selection  of  subjects,  wells, 
sites,  or  whatever  units  of  study  are  involved  in  an  investigation. 
In  reality,  the  use  of  fully  implemented  random  sampling  in 
scientific  investigation  is  extremely  rare.  Even  when  random 
sampling  is  attempted,  because  of  non-response  of  selected  subjects 
or  loss  to  follow-up,  it  is  seldom  possible  to  actually  complete  a 
truly  random  sample.  The  main  problem  is  that  perfectly 
implemented  random  sampling  is  often  simply  not  feasible.  Some 
alternatives  to  simple  random  sampling,  such  as  uniform  grid 
sampling  (with  a  random  starting  position) ,  which  was  feasible  to 
implement  in  the  uplands  terrestrial  study,  have  some  advantages 
over  simple  random  sampling  which  make  them  typically  more  precise 
and  preferable. 

In  the  NRDA,  random  sampling  was  used  whenever  it  was  feasible  arid 
preferable.  The  investigation  of  the  groundwater  resource  is  the 
primary  example  where  it  was  not  reasonable  to  start  fresh  and 
collect  new  data  to  insure  that  the  process  used  to  choose  well 
sites  was  in  fact  truly  random.  This  practical  reality  was 
dictated  by  the  large  cost  of  drilling  and  sampling  monitoring 
wells,  and  the  inherent  budgetary  limitations  within  which  a 
trustee  must  carry  out  an  NRDA.  Given  the  substantial  body  of  data 
available  from  other  recent  investigations  it  simply  was  judged 
inappropriate,  given  the  cost-efficiency  provisions  in  the 
regulations  to  spend  a  lot  more  money  on  fresh  data.  The 
regulations  implicitly  recognize  these  important  practical  issues. 
A  judgment  about  the  possible  value  of  both  costly  data  collection 
and  more  elaborate  and  costly  data  analysis  must  be  made  in  all 
aspects  of  the  strategic  planning  of  an  NRDA  investigation.  In  the 
groundwater  investigation  in  particular,  it  is  our  judgment  that 
the  strategy  followed  was  the  most  sensible  one  given  all  of  the 
trade-offs  involved.   [Brand] 


By  looking  at  the  well  positions  on  a  map  one  cannot  tell  if  there 
is  any  bias  in  the  sampling  scheme  or  not.  It  is  well  known  that 
if  you  pick  wells  sites  uniformly  at  random  in  an  area  you  will  get 
a  pattern  that  typically  appears  to  show  clustering.  This 
phenomenon  is  illustrated  in  the  spatial  analysis  book  by  Ripley  in 
Figure  3.1  (see  attachment  l).1  The  four  panels  in  Figure  3.1  show 
the  contrast  between  the  patterns  of  site  locations  that  typically 
arise  with  different  kinds  of  sampling.  Figure  3.1(a)  is  typical 
of  what  you  see  in  a  very  large  percentage  of  plots  resulting  from 
uniformly  randomly  placing  a  sample  of  points  on  the  plane.  It  is 
one  of  the  characteristics  of  human  perception  that  we  tend  to  see 
clusters  or  gaps  in  points  that  have  been  randomly  placed  on  a 
plane  even  when  there  is  no  clustering  mechanism  in  the  process 
that  generated  those  points.  Despite  the  typical  appearance  of 
clustering,  the  points  placed  by  uniform  random  location  of  sites 
guarantees  a  solid  basis  for  computing  p-values. 

The  appearance  of  clustering  can  be  even  more  pronounced  when  the 
regions  of  interest  are  irregular  as  is  the  case  in  the  groundwater 
investigations  in  the  State's  NRDA.  Typically,  the  State  used  as 
the  baseline  area  the  available  wells  in  close  proximity  to  the 
injured  area  without  being  overly  concerned  with  the  shape  of  the 
outside  perimeter  of  the  implicit  control  area  represented  by  this 
approach.  This  was  a  sensible  approach  relative  to  the  goal  of 
balancing  the  injured  and  reference  areas  with  respect  to 
geological,  hydrological  and  geochemical  properties.  In  our  view 
it  was  a  reasonable  and  fair  strategy. 

The  spacing  of  wells  picked  at  random  is  very  different  from  the 
spacing  one  sees  if  wells  are  systematically  picked  at  the 
intersection  points  on  a  grid  as  shown  in  Figure  3.1  (b)  .  We  agree 
with  the  ASR  that  gridded  sampling  is  often  a  reasonable 
alternative  to  uniform  random  sampling  although  these  two 
approaches  do  not  have  exactly  the  same  theoretical  statistical 
properties.  In  fact,  gridded  sampling  was  an  aspect  of  the 
sampling  scheme  used  in  the  investigation  of  injury  to  the 
terrestrial  resources  near  the  Anaconda  smelter. 

The  well  sites  do  not  systematically  favor  the  State's  injury 
claims.  Generally  the  injured  well  sites  were  well  scattered  for 
monitoring  purposes  or,  in  the  case  of  Butte  bedrock,  were  based  on 
mine  locations  which  were  rather  well  dispersed  in  the  injured 
area.  Moreover,  in  the  case  of  the  mine  sites  in  Butte,  there  very 
likely  was  some  mixing  of  groundwater.  This  mixing  would  cause 
inherent  averaging  which  also  diminishes  the  importance  of 
particular  well  locations.  This  situation  is  quite  different  from 
the  case  of  test  locations  derived  from  mineral  prospecting,  where 
additional  test  sites  are  often  located  in  the  vicinity  of  earlier 
detected  "hot  spots."   [Brand] 


Spatial  Statistics,  by  Brian  Ripley,  J.  Wiley  &  Sons,  1981,  pages  20  and  21. 


The  other  argument  raised  by  the  ASR  with  respect  to  well  placement 
is  that  the  reference  data,  for  example  in  the  Anaconda  area,  came 
from  only  high  quality  wells.  The  State's  injury  analysis  was,  to 
the  best  of  our  knowledge,  based  on  all  of  the  wells  for  which  data 
were  available  in  close  proximity  to  the  injured  areas.  Moreover, 
the  variability  of  the  control  data  was  in  general  relatively 
small,  so  it  is  unlikely  that  any  alteration  of  the  weighting  of 
data,  which  might  result  from  a  spatial  analysis,  could  change  the 
central  value  of  the  reference  area  very  much  relative  to  the 
substantially  different  results  obtained  in  the  injured  area. 

Some  of  the  reference  wells  are  sufficiently  close  to  the  injured 
area  so  that  they  may  be  somewhat  contaminated  from  the  same 
pathways  that  led  to  elevations  of  mineral  contaminants  in  the 
injured  area.  In  general  the  State  chose  to  use  all  available  data 
in  close  proximity  to  the  injured  area  to  maximize  the  geological, 
hydrological ,  and  geochemical  similarities  of  the  injured  and 
reference  areas.  In  so  doing,  there  was  a  real  risk  that  at  least 
some  of  the  baseline  data  would  be  contaminated  to  some  degree  by 
mining  related  influences.  This  would,  of  course,  bias  the  injury 
determination  in  favor  of  ARCO. 

It  is  also  claimed  in  the  ASR  that  some  wells  were  excluded  based 
on  water  quality  information  without  pathway  analysis.  It  is  not 
clear  to  what  this  vague  comment  is  referring.  To  our  knowledge, 
the  State  did  not  eliminate  wells  without  pathway  analysis.  We 
have  no  indication  that  any  meaningful  bias  from  this  source 
occurred.   [Brand] 

It  is  boldly  claimed  in  the  ASR  that  spatial  correlation  is 
"inevitable".  It  is  not  at  all  clear  that  there  is  any  meaningful 
amount  of  spatial  covariance  in  the  data.  Moreover,  ARCO  has  not 
provided  any  spatial  analysis  of  the  data  showing  that  substantial 
spatial  covariance  indeed  exists.  Thus,  ARCO  has  no  meaningful 
basis  for  disputing  the  reasonableness  of  the  findings  derived  from 
the  State's  method  of  analysis.  In  our  judgment  it  is  unlikely 
that  a  properly  done,  more  elaborate  analysis,  which  the  ASR  claims 
is  essential  on  theoretical  grounds,  will  make  much  difference  to 
the  substance  of  the  findings. 

IV.  MEASUREMENT  ISSUES  (S5.14;  05b;  A5,6a.7,9,1 1;  T1b,3a,4,5a) 

A  general  criticism  in  the  ASR  is  that  statistical  protocols  were 
sometimes  modified  after  data  were  collected.  It  is  almost  never 
possible  to  anticipate  all  important  aspects  of  an  experimental 
protocol  for  unique  field  and  laboratory  studies  in  advance. 
Modifications  of  the  State's  studies  are  documented.  Given  data 
collected  according  to  a  known  protocol,  it  is  also  difficult  to 
anticipate  all  appropriate  statistical  analyses  which  may  be  of 
interest  in  summarizing  results  of  the  study.   It  is  good  practice 


to  have  an  appropriate  statistical  protocol  in  place  before  data 
collection;  but,  it  is  also  poor  practice  to  limit  the  analysis  to 
only  those  methods.  For  example,  if  it  were  required  that  the 
statistical  protocol  be  in  place  before  data  are  collected  then  no 
analysis  of  the  historic  groundwater  data  could  be  given. 
Certainly,  most  of  the  spatial  analysis  suggested  in  the  ASR  could 
not  be  conducted  because  that  method  was  not  part  of  a  statistical 
protocol  before  data  collection.  The  bottom  line  is  that  the  final 
experimental  and  statistical  protocols  must  be  appropriate  and 
completely  documented. 

A.    Historical  and  Recent  Data 

For  the  groundwater  injury  investigation,  nearly  all  of  the  data 
came  from  historical  measurements  that  were  made  outside  of  the 
scope  of  the  NRDA.  Many  of  these  measurements  were  made  by  ARCO  or 
their  consulting  firms.  Others  were  made  by  public  agencies  before 
the  active  period  of  the  NRDA.  Typically,  good  Quality  Assurance/ 
Quality  control  procedures  were  followed  in  this  work.   [Brand] 

For  more  recent  data  collection  under  the  auspices  of  the  NRDA, 
written  protocols  including  defined  QA/QC  procedures  were  followed. 
Whenever  feasible,  for  example  for  the  surface  water  laboratory 
work  and  soil  sample  work,  the  laboratory  samples  from  injured  and 
control  areas  were  interspersed  in  time  and  were  coded  in  such  a 
way  that  their  origin  could  not  be  deduced  by  laboratory  personnel. 
Typically,  laboratory  workers  were  not  directly  connected  with  the 
NRDA  and  had  little  vested  interest  in  the  measurement  outcomes 
other  than  their  general  professional  responsibility  to  do  quality 
laboratory  work.  ARCO  was  given  advance  notice  of  all  data 
collection  episodes  and  either  had  representatives  present  or  chose 
to  forego  that  option.  ARCO  also  had  split  samples  which  were 
analyzed  and  did  not  challenge  the  State's  results.  In  summary, 
there  was  little  potential  for  workers  connected  with  the  NRDA  to 
introduce  any  bias  in  these  measurements. 

The  bulk  of  the  historical  surface  water  measurements  were  made  by 
public  agencies  that  were  doing  routine  monitoring  of  rivers  and 
streams.  The  measurement  procedures  used  typically  also  included 
appropriate  safeguards  for  quality  assurance  and  quality  and 
control.   [Brand] 

In  general,  new  measurements  that  were  made  for  assessing  injury  to 
aquatic  (surface  water,  sediments,  fish)  and  terrestrial  resources 
in  connection  with  the  NRDA  were  done  with  strict  attention  to 
QA/QC  procedures.  Many  of  the  analyses  were  made  in  third  party 
laboratories  that  typically  provide  services  to  many  clients. 
There  is  no  reason  to  believe  there  was  any  opportunity  or  any 
conscious  or  unconscious  intent  or  effort  by  NRDA  staff  or 
laboratory  personnel  to  generate  biased  measurements  which  favored 
the  State's  injury  claim.  Moreover  no  evidence  to  this  effect  has 
been  presented  in  the  ASR.   This  is  another  example  of  speculative 


criticism  that  has  no  foundation  in  fact. 

The  State's  decision  to  use  only  trout  data  was  a  decision  based  on 
the  fact  that  trout  are  the  premier  game  fish  in  Montana; 
therefore,  these  are  the  most  important  species  for  purposes  of 
injury  and  damage  guantif ication. 

B.    Grab  Samples  vs.  Four-Day  Averages   [Brand] 

The  ASR  expressed  the  opinion  that  a  "grab"  sample  of  surface  water 
or  an  average  of  several  grab  samples  over  a  four-day  period  would 
not  be  as  statistically  stable  as  a  four  day  average.  We  agree 
that  anything  short  of  a  continuous  four-day  average  would  have  an 
additional  component  of  measurement  variability.  Thus  a  grab 
sample  or  average  of  grab  samples  over  a  four-day  period  could  be 
either  higher  or  lower  than  the  continuous  time  average  during  the 
four-day  period. 

Also,  the  ASR  says  that  the  true  four-day  average  exceedances  will 
occur  less  often  than  exceedances  calculated  for  grab  samples  or 
even  the  average  of  several  grab  samples  over  a  four-day  period. 
This  statement  is  not  generally  true.  If  the  four-day  time  average 
exceeds  the  criterion  level,  then  measurement  fluctuations  involved 
in  grab  samples  will  only  cause  an  error  in  the  detection  of  an 
exceedance  when  the  grab  data  is  both  too  low  and  actually  below 
criterion.  In  this  case  the  use  of  grab  data  as  a  surrogate  for 
the  four-day  average  will  underestimate  the  number  of  exceedances. 
If  the  four-day  time  average  is  below  the  criterion  then  the 
reverse  is  true  and  the  grab  data  will  tend  to  overestimate  the 
number  of  exceedances.  It  is  not  entirely  clear  which  of  these 
cases  would  predominate  for  the  CFR.  However,  given  other  evidence 
about  the  way  contaminant  levels  tend  to  increase  during  the  spring 
rainy  season,  it  is  likely  that  grab  samples  may  well  underestimate 
the  number  of  exceedances  that  would  be  found  from  a  strict  four- 
day  time  average  approach. 

The  ASR  then  states  that  a  statistical  method  is  available  to 
adjust  for  different  ways  to  operationally  define  exceedance  and  it 
should  have  been  used  by  the  State.  For  the  historical  Clark  Fork 
assessments,  this  is  easy  to  say  but  not  so  easy  to  implement  in  a 
satisfactory  way.  It  is  clear  that  the  ASR's  comments  only  pertain 
to  the  EPA  definition  of  chronic  injury  to  surface  water.  They 
ignore  acute  injury.  Moreover,  the  ASR  comments  do  not  strictly 
pertain  to  the  applicable  water  quality  standards  for  chronic 
injury  for  the  surface  waters  involved  in  this  NRDA.  The 
applicable  Montana  standards  define  chronic  injury  in  terms  of  a 
four  day  or  longer  averaging  period.  This  difference  of  definition 
is  relevant  to  the  State's  procedure  for  measuring  chronic 
exceedance  which  is  discussed  below. 

There  is  no  dispute  about  the  definition  of  acute  injury  because  it 
is  simple  and  practical  to  implement.   The  EPA  standard  defines 


acute  injury  in  terms  of  a  one-hour  average.  EPA  research  found 
that  the  frequency  of  exceedances  based  on  a  single  grab  sample  and 
one-hour  average  were  virtually  the  same.2  The  Delos  data  did  not 
use  local  Montana  data  so  there  is  some  question  about  the 
applicability  of  his  finding  to  the  Clark  Fork  River.  However,  the 
Montana  standards  applicable  to  this  NRDA  do  not  make  a  distinction 
between  a  one-hour  average  and  a  grab  sample.  They  simply  express 
the  acute  criterion  in  terms  of  a  single  grab  sample.  Using  the 
Montana  acute  criterion,  even  with  relatively  infrequent  sampling, 
the  historical  data  for  the  CFR  and  the  SBC  show  consistent 
exceedances  in  excess  of  the  applicable  criteria.  These  are  well 
documented  in  Montana's  aquatics  report. 

The  historical  data,  which  are  available  for  examining  exceedance 
according  to  chronic  criterion,  do  not  provide  true  four-day 
averages.  It  simply  was  not  practical  for  agencies  that  were 
monitoring  surface  water  to  obtain  true  four-day  averages,  except 
in  specialized  research  projects  over  a  few  relatively  short  time 
periods.  In  general  true  four-day  averages  are  seldom  used.  Also, 
the  way  the  EPA  criterion  is  written,  even  an  occasional  true  four- 
day  average  would  not  be  enough.  One  would  really  need  constant 
monitoring  and  then  use  a  four-day  moving  time  window  to  determine 
how  many  four-day  periods  there  are  when  the  criterion  contaminant 
level  is  exceeded.  If  there  is  more  than  one  four-day  exceedance 
in  three  years  then  the  surface  water  has  been  injured  according  to 
applicable  chronic  criteria.  Clearly,  the  number  of  exceedances 
you  find  with  occasional  true  four-day  average  samples  is  likely  to 
be  substantially  less  than  you  would  find  with  the  continuous  but 
impractical  monitoring  that  would  be  compatible  with  the  EPA 
chronic  criterion. 

The  amount  of  work  required  to  determine  the  relationship  between 
grab  samples  and  four-day  average  exceedances  would  be  substantial. 
It  would  take  much  more  effort  than  the  glib  comments  in  the  ASR 
would  suggest.  It  is  necessary  to  have  properly  spaced  time  series 
data  at  crucial  times  of  the  year  to  empirically  determine  the 
temporal  patterns  of  metal  contaminants  in  local  waters.  It  is  not 
at  all  clear  that  appropriate  historical  data  for  local  waters  are 
available.  Data  from  outside  of  the  local  area  would  be  suspect 
since  the  pathways  involved  in  the  contamination  would  probably  not 
be  the  same.  Moreover,  there  are  many  temporal  autocorrelation 
models  that  could  be  applied.  A  careful  model  selection  procedure 
would  be  required  since,  for  this  particular  issue,  the  choice  of 
autocorrelation  model  will  be  crucial  and  controversial. 

The  State  has  used  a  sensible  and  much  more  cost-effective  approach 
to  operationally  define  chronic  exceedance.  It  is  based  on  the 
applicable  Montana  surface  water  quality  standards  for  chronic 


2  "Metal  Criteria  Excursions  in  the  Unspoiled  Watersheds,"  Draft  Report,  Charles  Delos,  Office  of  Water,  U.S.  EPA,  Wash.  DC,  Sept.  14, 
1990. 

10 


criteria  which  provide  for  use  of  a  four-day  or  longer  time 
average.  The  State  has  focused  on  a  single  three-month  time  window 
during  the  spring  runoff  in  each  year.  The  average  of  grab  samples 
obtained  during  this  three  month  window  provide  an  estimate  of  the 
three-month  time  average  during  this  period.  This  gives  a 
reasonable  and  practical  way  to  implement  the  applicable  Montana 
chronic  criterion.  It  gives  a  good  indication  that  substantial 
chronic  contamination  has  been  occurring  in  the  surface  waters 
involved  in  this  NRDA. 

C.    Fish  Abundance  and  Injury  Data   [McDonald] 

The  State  used  a  simple  straight-forward  statistical  analysis  (the 
sign  test)  to  show  that  statistically  significant  differences  exist 
in  fish  abundance  between  the  test  area  of  the  Clark  Fork  River  and 
Silver  Bow  Creek  and  the  matching  reference  reaches.  The  ASR 
states  that  "Incorrectly  treating  the  test  area  as  a  homogeneous 
unit  could  have  important  consequences  for  the  damage  assessment 
because  of  differences  that  exist  within  the  test  area."  This 
criticism  is  unfounded  because  the  design  of  the  study  resulted  in 
study  sites  in  Silver  Bow  and  Clark  Fork  that  were  different  from 
each  other  with  respect  to  geologic,  geomorphic  and  hydrologic 
attributes.  Stream  reaches  on  Silver  Bow  and  the  Clark  Fork  were 
matched  with  stream  reaches  from  the  reference  streams  before 
random  samples  were  selected  from  each. 

The  State's  study  also  measured  fish  abundance  during  the  same 
season  of  the  year  for  two  years  using  the  same  methods  to  account 
for  the  possibility  that  there  may  be  substantial  seasonal  or  time 
trends  in  fish  population  size.  ASR  includes  the  statement  that 
"...the  evident  long-term  temporal  complexity  of  fish  density  data 
is  not  accounted  for  in  the  State's  statistical  comparisons  of 
reference  and  impact  sites."  This  is  another  example  of 
speculative  criticism  that  calls  for  unrealistic  expensive  studies. 
The  fact  remains  that  differences  in  fish  abundance  were  found  for 
the  season  and  years  of  study  for  almost  all  parameters  (i.e., 
biomass,  abundance)  for  all  matched  pairs.  It  is  very  unlikely 
that  this  is  the  result  of  chance  alone. 

We  grant  that  study  of  dose-response  relationships  in  the  field  may 
be  useful.  Such  a  relationship  was  in  fact  studied  for  the 
evaluation  of  phytotoxicity  of  soils  in  the  uplands  terrestrial 
study  (the  results  of  which  were  criticized  in  another  section  of 
the  ASR)  .  Such  a  study  is  more  problematic  in  the  evaluation  of 
fish  abundance  because  of  the  complex  interactions  of  metals  in  the 
water,  metals  in  the  sediment  and  food,  effects  of  differences  in 
habitat  quantity  or  quality,  etc.  Such  studies  are  useful  only 
when  other  confounding  variables,  such  as  those  listed  above,  can 
be  controlled  for.  Other  study  procedures  were  used  in  the  State's 
assessment  of  injury  to  fish  including  estimation  of  differences  in 
fish  density  between  test  and  reference  sites,  laboratory  feeding 
studies,   laboratory  avoidance  studies,   laboratory  growth  and 

11 


survival  studies,  etc.  The  implication  that  field  dose-response 
data  may  have  lead  to  different  conclusions  concerning  fish  injury 
is  another  example  where  the  ASR  hopes  to  cast  doubt  on  the 
relevance  of  the  findings  from  the  State's  injury  investigations 
based  on  its  tone  and  presentation. 

D.    Terrestrial  Data   [McDonald] 

The  ASR  claims  that  there  is  biased  sampling  of  riparian  soils. 
The  State's  assessment  has  been  misunderstood.  It  was  necessary  to 
sample  only  the  slickens  for  evaluation  of  injury,  because  injury 
of  riparian  soils  is  being  claimed  only  for  the  areas  covered  by 
slickens. 

A  general  criticism  in  ASR  is  that  data  from  impact  and/or  control 
areas  have  been  pooled  inappropriately  and  the  analysis  is  not 
matched  to  the  paired  design  of  the  State's  study.  This  criticism 
seems  to  be  directed  at  the  phytotoxicity  study  where  soil  was 
collected  from  control  points  in  German  Gulch  and  transported  to 
the  laboratory  for  study  of  differences  in  plant  growth.  As 
explained  elsewhere  in  this  report,  points  were  located  on  a  grid 
over  the  uplands  impact  area  and  matched  with  points  in  German 
Gulch  for  study  of  differences  in  wildlife  habitat  and  vegetation 
present.  The  points  were  matched  on  the  basis  of  elevation, 
aspect,  slope,  etc.;  covariates  which  potentially  affect  vegetation 
in  the  field.  Once  the  soil  is  removed  from  impact  and  control 
points  and  taken  to  the  laboratory,  the  value  of  matching  is  gone, 
because  the  covariates  of  elevation,  slope,  aspect,  etc.  are  not 
present  in  the  laboratory.  For  this  reason  and  to  control  costs, 
parameters  of  plant  growth  in  the  soil  from  an  impact  site  were 
compared  to  growth  parameters  from  plants  in  composited  control 
soil  (or  to  means  of  replicated  trials  in  control  soil) .  In  other 
words,  the  decision  to  pool  was  based  on  non-statistical  issues. 

It  is  interesting  to  note  that  the  ASR  criticized  the  use  of 
composited  control  soil  (or  pooling  of  data  from  control  soils)  in 
the  State's  plant  growth  studies,  but  ARCO  consultants,  Dr.  Redente 
and  Dr.  Keammerer,  actually  pooled  soil  samples  to  the  extreme 
case.  We  quote  from  Dr.  Redente's  report  dated  July,  1995,  "At  the 
time  of  the  study  all  of  the  0-2  inch  soil  samples  from  German 
Gulch  were  combined  into  one  sample.  In  addition,  the  2-8  inch 
soil  samples  from  German  Gulch  were  combined  into  a  second  sample. 
The  same  procedure  was  followed  for  Stucky  Ridge  and  Mount  Haggin 
soils."  The  result  of  this  pooling  in  ARCO's  studies  is  that  no 
measure  of  spatial  variation  exists  in  ARCO's  data.  All  variation 
in  plant  growth  is  strictly  due  to  laboratory  technique.  The 
State's  study  maintained  separate  samples  from  the  impact  areas  and 
limited  pooling  of  spatial  samples  of  control  soils.  Statistical 
significance  in  the  State's  study  was  judged  on  the  basis  of 
spatial  variation  combined  with  laboratory  variation;  this  was  a 
much  more  conservative  approach  to  testing,  which  included  non- 
standard statistical  treatment  of  control  samples  to  account  for 

12 


the  limited  compositing  conducted  by  the  State. 

V.  CONSISTENCY  AND  INTERPRETATION  OF  THE  STATE'S  STATISTICAL 
METHODS  AND  RESULTS  (S6,8,9,12.17;  07,9.10,11,12.13; 
G5b,8,9a;  A1,12a;  T3b,7) 

A.    P-Values  and  Alpha  Levels 

A  number  of  comments  in  the  ASR  pertain  to  the  way  p-values  or 
related  alpha  levels  were  handled  in  reports  prepared  by  the  State. 

The  alpha  level  of  a  statistical  test  stems  from  a  subjective 
choice.  The  alpha  level  measures  one  aspect  of  the  performance  of 
a  statistical  test  when  it  is  being  used  to  make  a  judgment  about 
the  meaning  of  an  observed  difference,  say  between  an  injured  and 
a  reference  area.  The  choice  of  an  alpha  level  provides  a 
criterion  for  judging  the  "statistical  significance"  of  an  observed 
difference.  It  is  the  subjective  value  of  probability  allowed  by 
a  researcher  that  the  observed  difference  can  occur  by  chance  alone 
when  there  is  really  no  true  difference  between  the  injured  and 
reference  areas.  The  play  of  chance  comes  from  random  aspects  of 
sampling  and  measuring  processes  used  to  get  sample  data  from  the 
comparison  areas.  Alpha  levels  of  0.05  (5%)  or  0.10  (10%)  are  most 
often  used. 

Since  the  alpha  level  must  ultimately  be  chosen  by  the  person  who 
will  interpret  the  observed  statistical  difference,  it  is  important 
that  the  conclusions  to  be  drawn  from  results  of  a  statistical 
analysis  be  flexible  enough  to  accommodate  any  chosen  alpha  level. 
The  desired  flexibility  is  provided  by  the  p-value,  which  is 
defined  as  the  smallest  alpha  level  at  which  the  observed  result 
would  be  statistically  significant  (i.e.,  considered  more  than  a 
chance  finding) .  Any  user  of  the  statistical  results,  when 
interpreting  a  statistical  difference,  can  compare  their  own  alpha 
level  with  the  reported  p-value.  If  the  p-value  is  less  than  their 
personal  alpha  level  then  the  conclusion  which  must  be  drawn  is 
that  the  observed  difference  is  "statistically  significant." 

Since  the  alpha  level  is  a  value  to  be  determined  on  the  basis  of 
professional  judgment,  there  are  typically  different  opinions  about 
what  level  to  use.  Not  surprisingly,  different  views  of  the 
preferred  alpha  level  can  be  found  between  State  and  ARCO 
consultants  and  even  between  the  various  investigators  within  each 
group. 

By  contrast  a  p-value  gives  a  direct  measure  obtained  from  the 
available  data,  which  is  not  subject  to  the  same  inherent 
individual  subjectivity.  The  p-value  provides  relevant  information 
about  whether  chance  alone  is  a  plausible  explanation  for  the 
observed  differences  between  injured  and  reference  areas. 

13 


The  ASR  has  made  a  considerable  number  of  comments,  which  we  will 
now  turn  to,  about  how  p-values  were  calculated,  what  they  actually 
indicate,  and  what  their  limitations  are. 

First,  the  ASR  noted  that  different  statistical  testing  procedures 
for  computing  p-values  were  used  in  different  parts  of  the  State's 
statistical  work.  Moreover,  the  ASR  attributes  some  sinister 
intent  to  the  use  of  different  tests  in  different  contexts.  In 
reality,  there  are  hundreds,  if  not  thousands,  of  different 
statistical  tests  that  have  been  developed  in  the  cumulative 
history  of  statistical  theory.  Depending  on  the  statistical 
structure  of  a  problem,  different  tests  are  required.  Some 
differences  across  reports  or  even  between  parts  of  reports  stem 
simply  from  the  fact  that  different  questions  inherently  involve 
different  statistical  setups  that  demand  different  statistical 
testing  procedures. 

Even  for  a  single  question  with  a  specific  statistical  structure 
there  is  usually  more  than  one  testing  option  which  is  appropriate 
for  drawing  inductive  inferences.  Despite,  the  great  deal  of 
effort  that  theoretical  statistical  investigations  have  put  into 
exploring  alternative  approaches  for  statistical  testing, 
experienced  practicing  statisticians  find  that  the  p-values 
constructed  from  appropriate  alternative  approaches  to  the  same 
data  typically  do  not  change  the  substance  of  the  conclusions. 

Also,  a  practicing  statistician  typically  has  more  experience  with 
some  procedures  than  others,  and  investigators  in  substantive  areas 
who  are  only  modestly  trained  in  statistical  methods  typically  have 
working  familiarity  with  even  fewer  options  and  tend  to  gravitate 
to  the  methods  with  which  they  are  most  familiar.  Some  testing 
approaches,  such  as  randomization  tests,  require  more  specialized 
computer  software  and  knowledge,  while  software  for  other,  more 
familiar  methods,  such  as  the  t-test  or  the  Mann-Whitney  test,  are 
typically  more  generally  understood  and  more  easily  implemented. 
The  regulations  themselves  prescribe  specific  tests  for  some 
resources  but  not  for  others.  For  example,  for  groundwater 
comparisons,  the  specific  prescribed  tests  are  the  Mann-Whitney 
test,  which  is  easy  to  implement  and  the  ranked  squares  test,  which 
takes  deeper  knowledge  and  some  custom  programming.  The  variations 
in  methods  for  statistical  testing  that  appear  in  the  State's 
reports  stem  from  these  factors  rather  than  any  attempt  to  somehow 
stack  the  deck  in  the  State's  favor. 

The  ASR  correctly  points  out  that  the  p-value  does  not  by  itself 
measure  the  "importance"  of  observed  difference  between  an  injured 
and  reference  area.  The  State  has  not  tried  to  use  p-values  to 
measure  "importance"  and  it  is  not  clear  to  us  why  the  ASR  even 
makes  this  claim.  Descriptive,  numerical  or  graphical 
presentations,  which  show  the  size  of  the  observed  differences 
between  injured  and  reference  areas,  have  been  consistently 
presented  in  the  State's  reports.   The  p-values  guide  judgments 

14 


about  whether  chance  alone  can  be  reasonably  eliminated  as  the 
basis  for  observed  differences.  The  descriptive  results,  along 
with  documentation  of  the  criterion  levels,  such  as  MCL's  or 
SMCL's,  provide  the  basis  for  assessing  the  "importance"  of  the 
observed  difference  between  the  injured  and  reference  comparison 
areas. 

B.    Confidence  Intervals  and  Standard  Errors 

The  ASR  argues  that  the  State's  use  of  means,  medians,  graphs, 
etc.,  along  with  p-values  is  not  sufficient.  Instead  it  indicates 
a  preference  for  the  use  of  a  confidence  interval  (CI)  or  a 
standard  error  (SE)  to  indicate  the  precision  of  the  observed 
descriptive  measures  such  as  means,  medians,  differences  between 
medians  or  means,  etc.  We  do  not  agree  that  the  CI  or  SE  approach 
for  assessing  the  impact  of  chance  variation  on  the  estimates  is  a 
necessary  or  a  preferred  way  to  augment  descriptive  statistical 
results  in  an  NRDA.  This  viewpoint  derives  largely  from  the 
specific  nature  of  the  decision  process  that  is  involved  in  an 
NRDA.  However,  as  indicated  below,  there  are  some  specific 
situations  in  an  NRDA  for  which  the  CI  or  SE  approach  is  guite 
relevant.  Following  a  brief  reminder  of  the  nature  of  CI's  and  SE's 
and  the  information  they  provide,  we  give  the  reason  for  our  views. 

A  fundamental  statistical  procedure  is  to  use  sample  data  to 
construct  an  estimate  of  a  distribution  property  such  as  a  median, 
mean,  or  perhaps  the  difference  between  two  medians  or  means.  Such 
estimates,  which  are  often  called  point  estimates  because  they  boil 
the  data  down  to  a  single  number,  are  carefully  designed  to  give 
the  best  single  indication  of  what  is  true  for  the  underlying 
domain (s)  that  are  being  investigated  by  the  sample  data.  Point 
estimation  is  a  form  of  statistical  inference  since  point  estimates 
are  used  to  make  an  inference  about  the  underlying  hidden  entity, 
say  the  injured  area  or  reference  area  or  both,  based  on  the 
observed  data  from  the  sample.  Since  a  sample  gives  only  a  partial 
view  of  the  underlying  domain,  and  sample  data  can  vary  by  the  play 
of  chance  in  the  choice  and  measurement  of  test  samples,  one  must 
recognize  that  the  observed  estimates  may  be  too  high  or  too  low. 
The  point  estimate  is  designed  to  do  the  best  possible  job  of 
balancing  between  getting  an  answer  that  is  too  high  and  one  that 
is  too  low. 

A  "confidence  interval"  (CI)  augments  the  point  estimate.  In 
addition  to  the  point  estimate,  an  interval  is  provided.  The 
interval  contains  the  point  estimate  and  it  is  constructed  so  it 
will  contain  the  true  value  that  the  estimate  is  trying  to 
determine.  The  CI  does  this  with  a  prescribed  probability  called 
the  "confidence  level."  The  choice  of  a  confidence  level,  like  the 
alpha  level  to  which  it  is  related,  is  a  professional  judgment. 
Confidence  levels  are  typically  set  at  90%,  95%,  99%. 

In  many  statistical  problems,  the  CI  is  an  interval  with  the  point 

15 


estimate  at  its  center.  For  a  95%  CI  in  such  cases,  the  distance 
from  the  center  to  the  upper  or  lower  endpoint  of  the  CI  is 
typically  about  two  times  the  related  statistical  quantity  called 
the  "standard  error"  (SE)  .  Thus  the  CI  and  the  SE  are  conceptually 
linked  and  typically  contain  the  same  basic  information  about 
precision. 

In  many  statistical  approaches  for  studying  differences,  there  is 
a  close  relationship  between  a  statistical  test  and  a  CI.  When  the 
confidence  level  (C)  and  alpha  are  expressed  as  percentages,  an 
observed  difference  would  be  declared  unlikely  to  be  due  to  chance 
alone  at  the  chosen  alpha  level  (i.e.  statistically  significant)  if 
and  only  if  the  confidence  interval  with  confidence  level  C%= 
100-alpha%  is  located  away  from  zero. 

The  p-value  is  often  rather  informative  from  a  CI  point  of  view. 
When  studying  a  difference,  there  is  typically  a  simple  connection 
between  a  p-value  and  a  related  CI.  Using  the  zinc  results  from 
the  Butte  bedrock  data  (injured  vs.  reference  areas)  as  an  example, 
the  observed  difference  in  medians  is  16,200  -  80.5  =  16,119.5  ppb 
with  p=0.003  (0.26%).  To  a  reasonable  approximation,  this 
corresponds  to  saying  that  the  confidence  interval  for  the  true 
difference  between  medians,  with  confidence  (C)  =  (100%-0.31%)  = 
99.7%,  would  go  from  about  0  ppb  to  32,339  ppb.  This  is  0  to  twice 
the  observed  difference  in  medians.  If  the  p-value  for  this  result 
had  been  0.01  (1%)  then  the  confidence  level  for  the  same 
confidence  interval  from  0  to  32,339  ppb  would  have  been  99%.  In 
general  the  p-value  will  determine  the  confidence  level  for  the 
confidence  interval  which  goes  from  0  to  twice  the  observed 
difference. 

A  90%  or  a  95%  CI  for  the  true  difference  of  medians  cannot  be  so 
readily  determined  from  the  p-value  itself.  A  slightly  more 
complicated  calculation  would  be  required.  These  intervals,  would 
also  be  centered  on  16,119.5  but  would  not  extend  as  far  up  and 
down  from  this  central  value. 

The  CI  corresponding  to  an  observed  difference  does  give  some  idea 
how  much  too  low  or  too  high  the  estimated  difference  might  be. 
The  point  estimate  could  favor  either  the  State  or  ARCO  but  the  CI 
does  not  provide  any  information  about  who  was  favored  by  the  play 
of  chance.  The  point  estimate  cuts  through  chance  variation  and  it 
avoids  favoritism.  It  is  the  best  available  estimate.  It  can  also 
be  shown  using  statistical  theory  based  on  likelihood  methods,  that 
the  relative  likelihood  that  the  observed  data  was  generated  by  a 
true  value  different  from  the  point  estimate  diminishes 
progressively  the  further  you  go  either  up  or  down  from  the  point 
estimate.  The  bottom  line  is  that  the  point  estimate  is  a  fair 
approach  for  making  an  injury  assessment  that  balances  chance  to 
get  the  best  indication  of  truth.  From  this  point  of  view  the  CI 
and  the  SE  are  not  particularly  informative  and  basically  do  not 
add  much  useful  new  information  for  reaching  a  fair  decision. 

16 


Furthermore  the  significance  test  starts  by  giving  special  status 
to  a  particular  working  assumption,  typically  called  the  null 
hypothesis,  that  there  is  no  difference  between  the  injured  and 
reference  areas.  It  then  proceeds  to  assemble  evidence  against 
this  null  hypothesis.  This  statistical  procedure  inherently  favors 
ARCO  by  placing  the  burden  of  proof  on  the  State  to  demonstrate 
that  there  is  a  difference  between  the  injured  and  reference  areas 
that  is  not  due  to  chance  alone.  The  p-value  implements  this  basic 
examination  of  the  findings  in  a  concise  way.  If  an  observed 
difference  is  statistically  significant,  that  is  the  p-value  is 
less  than  alpha  level,  there  is  little  to  be  gained  from  having  a 
CI.  In  fact,  the  NRDA  injury  assessment  has  typically  resulted  in 
impressive  p-values. 

In  an  NRDA,  there  is  an  essential  role  for  a  CI  in  the  situation 
when  the  observed  difference  between  an  injured  and  a  reference 
area  is  not  statistically  significant.  It  is  the  nature  of 
statistical  significance  tests  that  they  can  be  used  to  disprove 
the  null  hypothesis  but  not  to  prove  it.  It  is  well  understood  by 
experienced  statisticians  that  failure  to  reject  a  null  hypothesis 
because  the  p-value  is  not  sufficiently  low  is  not  eguivalent  to 
being  able  to  accept  the  null  hypothesis.  There  is  widespread 
misunderstanding  about  this  subtle  point  among  substantive 
investigators  and  unfortunately  among  some  statisticians  as  well. 
To  prove  a  null  hypothesis,  say  a  null  hypothesis  that  there  is  NO 
difference  between  the  injured  and  reference  area,  it  would  be 
necessary  to  construct  a  CI  and  find  that  all  of  the  plausible  true 
differences  that  are  contained  in  the  interval  are  of  low 
"importance".  This  approach  would  be  required  for  example  for  ARCO 
to  argue  that  there  is  no  injury  from  mining  operations. 

C.    Mann-Whitney  and  Ranked  Squares  Tests 

The  ASR  criticizes  the  Mann-Whitney  (MW)  significance  test  method 
prescribed  by  the  regulations  for  comparing  injured  vs.  reference 
groundwater  distributions  of  contaminant  levels.  The  ASR  argues 
that  the  MW  test  is  defective  because  it  can  detect  various  ways  in 
which  these  distributions  differ.  We  agree  that  the  MW,  the 
randomization  tests,  and  the  standard  parametric  tests  can  in  some 
cases,  give  a  significant  result  because  the  distribution  of  values 
for  the  injured  area  is:  (1)  located  higher  than  the  reference 
distribution,  (2)  is  more  variable  than  the  reference  distribution 
or  (3)  both.  This  is  exactly  as  it  should  be.  When  an  area  has 
been  injured  by  mining  it  would  be  surprising  if  the  additional 
mining-related  influences  will  be  exactly  the  same  at  every  point 
in  the  injured  area.  In  addition  to  the  upward  shift  in  measured 
values  for  the  contaminant  of  interest  in  the  injured  area,  it  is 
typical  for  variability  of  the  measured  values  to  increase  as  well. 
It  is  very  common  in  statistical  practice  to  find  an  increase  in 
variability  along  with  an  upward  shift  in  the  location  of  a 
distribution. 


17 


r 


It  is  even  possible  that  a  mining  injury  could  produce  localized, 
high-concentration  hot-spots  that  will  be  found  in  less  than  half 
of  the  samples,  for  example,  from  less  than  half  of  the  test  wells 
in  an  injured  area.  In  this  case,  the  median  of  the  distribution 
for  the  sample  of  injured  wells  would  not  be  relocated  at  all  but 
the  area  would  still  be  substantially  injured.  This  kind  of 
spatially  heterogeneous  injury  would  however  show  up  as  increased 
variability.  It  is  entirely  appropriate  that  the  State  has  used 
statistical  tests  that  in  some  cases  are  also  capable  of  detecting 
spatially  heterogeneous  impact. 

In  addition  to  the  MW  test,  which  is  capable  of  detecting  either 
heterogeneous  or  homogenous  impact,  the  regulations  also  prescribe 
the  ranked  sguares  test,  which  is  focused  more  directly  on  the 
variability  issue  per  se.  These  two  tests  and  the  corresponding 
descriptive  measures,  the  medians  and  inter  guartile  ranges, 
provide  a  good  understanding  of  what  sort  of  differences  are 
present  in  the  injured  compared  to  the  reference  area. 

D.    Multiple  Comparisons 

The  ASR  also  raises  the  issue  that  multiple  comparison  methods  were 
not  consistently  used  in  situations  where  families  of  statistical 
significance  tests  were  applied.  In  particular  this  issue  was 
raised  in  the  deposition  of  R.  J.  Brand  relative  to  the  groundwater 
report  where  multiple  comparison  methods  were  not  used  in 
connection  with  the  family  of  comparisons  between  the  injured  and 
reference  areas.  The  members  of  the  family  in  this  case  correspond 
to  the  comparisons  of  different  contaminants  that  were  available 
from  test  wells  in  the  injured  and  reference  areas.  This  was  in 
contrast  to  some  instances  in  other  reports  where  multiple 
comparison  methods  were  used  for  some  families  of  related 
significance  tests. 

Before  discussing  this  issue  further  it  is  necessary  to  mention  a 
few  things  about  the  general  nature  of  multiple  comparison  methods. 
First,  to  use  multiple  comparison  methods,  one  must  first  specify 
the  relevant  family  of  statistical  comparisons  to  which  the 
multiple  comparison  method  is  to  be  applied. 

Second,  the  multiple  comparison  method,  in  one  of  its  common  forms, 
is  an  extension  of  the  idea  of  a  statistical  significance  test. 
Instead  of  focusing  on  a  single  comparison  involving,  say  zinc,  it 
focuses  on  a  whole  collection  of  comparisons  simultaneously.  To 
understand  how  this  is  done  we  must  first  recall  some  of  the 
concepts  involved  with  any  single  test.  One  goal  in  a  statistical 
significance  test  is  to  protect  against  the  possibility  that  an 
observed  difference  may  happen  by  chance  alone  when  there  is  truly 
no  difference  between  the  distributions  of  contaminant  levels  in 
the  comparison  areas.  Again,  the  probability  that  this  type  of 
error  can  happen  by  chance  alone  is  called  the  alpha  level  of  the 
test.   If  the  alpha  level  is  set  in  advance  at  10%,  5%,  or  1%,  then 

18 


this  specifies  the  risk  of  falsely  concluding  there  is  a  difference 
that  was  actually  due  to  chance. 

With  the  multiple  comparisons  method  for  significance  testing,  one 
considers  a  "global  alpha  level."  This  refers  to  the  probability 
that  one  or  more  of  the  family  of  statistical  comparisons  could 
give  a  statistically  significant  difference  by  chance  alone  when, 
in  fact,  each  of  the  contaminants  truly  has  the  same  distributions 
in  the  injured  area  as  it  does  in  the  reference  area.  The  global 
alpha  level  has  a  corresponding  p-value  which  can  be  called  the 
"global  p-value." 

There  are  three  common  ways  to  implement  the  multiple  comparison 
approach.  One  is  to  compute  the  global  p-value  and  report  it. 
Another  is  to  use  adjusted  individual  alpha  levels.  When  adjusted 
alpha  levels  are  used  for  the  individual  comparisons  and  one  or 
more  observed  differences  is  significant  relative  to  the  adjusted 
individual  alpha  level,  then  the  family  of  comparisons  is 
significant  at  the  specified  global  alpha  level.  A  third  way  is 
discussed  below. 

A  commonly  used  procedure  for  constructing  a  global  p-value  or  for 
getting  adjusted  individual  alpha  levels  is  to  use  the  "Bonferroni 
method"  mentioned  in  the  ASR.  The  Bonferroni  method  is  generally 
used  with  equal  adjustment  for  all  members  of  a  family  of  K 
comparisons.  Then  the  adjusted  individual  alpha  levels  are  simply 
the  global  alpha  level  divided  by  K.  Alternatively,  the  global 
p-value  is  K  times  the  minimum  (individual  p-value  observed  in  the 
family  of  comparisons) .  The  Bonferroni  method  is  the  easiest 
multiple  comparison  method  to  use  but  is  often  rather  conservative, 
tending  to  produce  less  significant  results  than  some  more 
sophisticated  but  more  difficult  to  implement  methods. 

The  Bonferroni  method  can  be  illustrated  with  a  numerical  example 
as  follows:  If  the  desired  global  alpha  level  is  10%  and  there  are 
10  comparisons  in  the  relevant  family  of  comparisons,  then  the 
adjusted  alpha  level  for  individual  comparisons  is  simply  1%  rather 
than  10%.  Alternatively,  if  the  minimum  observed  p-value  for  one 
of  10  comparisons  is  p=0.001  then  the  global  p-value  is  10  times 
0.001=0.01.  This  global  p-value  is  the  smallest  global  alpha  level 
at  which  you  could  conclude  that  one  or  more  of  the  observed 
differences  is  larger  than  you  could  expect  from  chance  alone. 

Given  how  the  Bonferroni  multiple  comparison  method  works,  it  is 
now  easier  to  describe  the  third  way  of  dealing  with  the  multiple 
comparisons.  Assuming  the  report  indicates  how  many  comparisons 
are  in  the  relevant  family  or  it  is  clear  from  the  context,  simply 
report  the  individual  comparison  p-values  and  let  the  reader  make 
the  simple  adjustments  that  are  involved  in  interpreting  the 
results  from  a  multiple  comparisons  point  of  view.  Many 
statisticians  and  statistical  practitioners  follow  this  third 
approach.    Its  main  limitation  is  that  it  assumes  some  under- 

19 


standing  of  the  multiple  comparisons  method  on  the  part  of  the 
reader. 

There  is  disagreement  among  statisticians  about  the  best  way  to 
handle  the  multiple  comparisons  issue.  This  controversy  is 
discussed  in  a  recently  published  book  on  multiple  comparison 
methods.3  Since  the  various  reports  prepared  by  the  State  have 
involved  different  statisticians  and  investigators,  it  is  not 
surprising  that  there  has  been  some  difference  of  style  as  to 
whether  or  not  the  multiple  comparisons  adjustments  have  been 
explicitly  included  in  the  reports. 

To  put  the  multiple  comparison  issue  in  perspective,  it  is  useful 
to  consider  the  Butte  groundwater  report.  Consider  the  family  of 
comparisons  between  the  Butte  Hill  injured  bedrock  data  and  the 
corresponding  data  for  the  reference  area  (Table  3-3) .  Suppose 
that  K=12  comparisons  in  the  data  set  are  ultimately  judged  to  be 
relevant  for  injury  assessment.  The  minimum  individual  p-value, 
which  is  0.0001  for  S04  would  be  multiplied  by  12  which  gives 
0.0012  for  the  global  p-value.  Chance  can  still  be  rather 
convincingly  eliminated  as  a  plausible  explanation  for  the  observed 
differences.  Alternatively,  if  the  analyst  wants  to  use  a  global 
alpha  level  equal  to  0.10,  a  fairly  typical  value  in  a  multiple 
comparisons  application,  then  the  adjusted  individual  alpha  levels 
would  be  0 . 1/12=. 0083 .  Seven  of  the  12  contaminants  have  p-values 
lower  than  this  but  only  one  would  be  needed  to  achieve 
significance  from  a  multiple  comparisons  point  of  view. 


VI.   ALTERNATE  STATISTICAL  METHODS 

There  are  numerous  component  parts  to  the  State's  NRDA.  The  ASR 
(Sll)  expresses  the  concern  that  the  analysis  methods  used  in  some 
portions  of  the  NRDA  were  not  the  analysis  methods  that  were  the 
most  appropriate  for  the  sampling  design  used  to  collect  the  data. 
It  is  hard  to  respond  to  this  comment  because  it  lacks  specificity. 
The  comment  appears  to  be  part  of  the  ASR's  general  effort  to  cast 
doubt  on  the  States' s  methodology  and  to  establish  the  impression 
that  some  other,  and  presumably  more  appropriate,  methods  would 
have  been  fairer. 

The  ASR  did  comment  that  some  paired  samples  were  analyzed  without 
considering  the  pairing.  Typically,  an  analysis  of  paired  data 
which  does  not  exploit  pairing  would  not  alter  descriptive 
differences  between  the  injured  and  reference  areas.  However,  as 
is  well  known,  when  pairing  has  not  been  exploited  in  an  analysis 
of  paired  data,  one  typically  gets  p-values  that  are  conservative 
and  less  significant  because  they  are  larger  than  need  be.   Thus, 


1  "Resampling-Based  Multiple  Testing,"  by  P.  H.  Westfall  and  S.  S.  Young,  J.  Wiley  &  Sons,  1993,  section  1.5. 

20 


in  contrast  to  the  impression  left  by  the  ASR,  non-use  of  paired 
methods  in  analysis  of  paired  data  would  favor  ARCO  rather  than  the 
State. 

A.    Different  Pooling-Weighting  of  Repeat  Measurements  (87;  G7; 
A4,10)   [Brand] 

The  ASR  argues  that  the  State  should  have  further  diagnosed  the 
component  sources  of  variability  within  injured  and  reference 
areas.  In  particular,  the  ASR  expressed  concern  about  the 
variability  between  repeat  measurements  at  particular  well  sites, 
river  reaches,  etc.,  in  addition  to  the  "pure"  variability  between 
sites,  reaches,  etc. 

The  ASR  raised  another  issue  connected  with  repeat  measurements  at 
the  sites  and  the  variability  of  these  repeat  measurements.  It 
noted  that  when  repeat  data  at  each  site  were  boiled  down  to  a 
single  mean  or  median,  there  were  often  varying  numbers  of  site- 
specific  repeat  numbers  that  went  into  calculation  of  the  single 
value  for  the  site.  The  ASR  argues  that  when  site-specific  data 
were  then  used  to  get  overall  means  or  medians  to  measure  impacts 
in  the  injured  vs.  reference  areas,  the  State  should  have  used  a 
more  elaborate  analysis  which  weighted  site-specific  data  by  the 
number  of  repeat  measurements  available  for  the  respective  sites. 
Also,  the  ASR  gives  the  impression  that  such  a  more  elaborate 
analysis  and  statistical  work-up  of  variability  will  change  the 
results  substantially  in  ARCO's  favor. 

When  the  State  compared  injured  vs.  control  areas  it  implicitly 
took  into  consideration  both  between-site  and  pure  within-site 
variability.  Typically,  neither  of  these  can  be  observed  directly. 
Instead  they  combine  to  produce  the  observed  total  variability  in 
the  summary  numbers  constructed  to  represent  the  sites.  It  is  this 
overall  total  variability  that  shows  up  in  data  plots  of  site  to 
site  values  by  area.  It  is  also  this  overall  total  variability 
that  comes  into  play  in  the  determination  of  p-values  for 
comparison  of  injured  vs.  reference  areas.  The  State's  procedures 
gave  full  attention  to  this  overall  total  variability. 

With  respect  to  weighted  analysis,  it  is  well  known  that  both 
weighted  and  unweighted  analysis  methods  are  unbiased  in  the  sense 
that  they  give  the  true  answer  on  the  average.  Thus  using 
unweighted  analysis,  which  could  be  implemented  more  cost 
effectively,  was  in  no  way  systematically  unfair  to  ARCO.  On 
average  it  gives  the  same  descriptive  findings  as  the  weighted 
analysis. 

It  is  also  known  that  the  unweighted  analysis  tends  to  be  more 
conservative  than  the  weighted  analysis  in  the  sense  that  it  tends 
to  produce  p-values  that  are  larger  and  therefore  less  significant. 
Therefore,  for  the  assessment  of  statistical  significance,  the 
unweighted  analysis  most  likely  favored  ARCO. 

21 


The  unweighted  approach  was  necessary  in  order  to  implement  the 
Mann-Whitney  and  ranked  squares  tests  prescribed  in  the  groundwater 
section  of  the  regulations  since  these  tests  do  not  have  provisions 
for  doing  weighted  analysis.  The  bottom  line  is  that  the  analysis 
methods  used  by  the  State  were  fair  to  ARCO  and  alternative 
weighted  analysis  would  be  unlikely  to  make  any  meaningful 
difference  in  the  outcome  of  the  injury  assessment. 

B.    Adjustment  for  Covariables.  (82/3,13;  02 ,4b, 6;  G4,6,9b; 
A2,6,12b;  T5b,6) 

The  State's  method  for  injury  quantification  was  to  obtain  fair, 
appropriate  reference  data  for  comparison  to  the  data  from  the 
injured  area.  It  is  conceivable  that  there  is  some  within-area, 
site-to-site  variation  in  some  covariables.  It  is  also  conceivable 
that  the  mean  levels  of  some  covariables  differ  somewhat  between 
the  injured  and  reference  areas.  In  principle,  covariables  which 
measure  possible  site  to  site  differences  could  be  measured  and 
used  in  a  more  elaborate  statistical  analysis  to  fine-tune  an 
injured  vs.  reference  comparison.  An  adjustment  of  this  type  would 
require  use  of  an  assumed  statistical  model  and  the  particular 
model  chosen  would  have  an  impact  on  the  outcome  of  such  an 
analysis. 

The  ASR  suggests  some  possible  models  for  covariable  analysis  and 
adjustment  which,  in  effect,  represents  an  opinion  about  the  level 
of  statistical  technology  that  should  be  used  in  the  NRDA.  To  use 
the  suggested  model,  for  many  of  the  outcome  variables,  it  would  be 
necessary  to  use  variable  transformations  to  make  the  outcome 
variable  more  symmetrically  distributed.  It  should  be  noted  that 
more  elaborate  models  and  methods  than  those  suggested  by  the  ASR 
are  also  possible. 

There  are  several  important  points  that  should  be  noted  about 
adjustment  for  covariates.  The  first  is  that  the  adjustment  could 
systematically  alter  the  injury  quantification  to  either  decrease 
or  increase  it.  There  is,  in  general,  no  reason  to  expect  that 
adjustment  would  lead  to  a  new  result  that  would  be  more  favorable 
to  ARCO.  Thus,  in  this  sense,  non-adjusted  analysis  is  fair  to 
both  sides. 

The  second  point  about  covariable  adjustment  is  that,  whether  it 
leads  to  an  upward  or  downward  systematic  adjustment,  it  tends  to 
reduce  the  amount  of  variability  that  must  be  dealt  with  in  making 
the  broad  comparison  between  means  or  medians  in  the  injured  and 
reference  areas.  This  aspect  of  the  adjustment  methodology  would 
tend  to  increase  the  precision  of  the  injured  vs.  reference 
comparison  and  lead  to  more  significant  differences  (i.e.  lower  p- 
values)  .  In  this  way,  not  using  adjustment  for  covariables  has 
been  favorable  to  ARCO. 

It  is  commonly  understood  that  for  covariance  analysis  to  lead  to 

22 


an  actual  adjustment  of  the  injured  vs.  references  difference,  the 
covariable  must  be  associated  with  the  outcome  variable  and  the 
mean  value  of  the  covariable  must  differ  between  the  injured  and 
reference  areas.  If  only  one  of  these  two  conditions  exists  then 
adjustment  via  covariance  analysis  will  not  alter  the  estimated 
difference  between  the  injured  and  reference  areas. 

If  the  covariable  is  associated  with  the  outcome  but  the  mean 
values  of  the  covariable  in  the  injured  and  reference  areas  are 
about  the  same,  then  the  estimated  difference  between  areas  will 
not  be  altered.  However,  the  precision  of  the  estimator  of  the 
differences  will  be  increased.  This  will  tend  to  make  the 
difference  more  statistically  significant.  In  this  case,  not  doing 
covariable  analysis  would  favor  ARCO. 

If  the  covariable  is  not  associated  with  the  outcome  variable  then 
the  estimated  difference  between  the  injured  and  reference  areas 
would  not  be  altered  even  if  the  mean  of  the  covariable  is  not  the 
same  in  both  the  injured  and  reference  areas.  In  this  case  doing 
or  not  doing  covariable  analysis  would  give  the  same  result. 

For  the  parallel-line  covariance  model  mentioned  in  the  ASR,  the 
adjustment  effect  would  be  the  product  of  two  factors.  The  first 
would  be  the  slope  of  the  lines  relating  the  covariable  to  the 
outcome  variable.  The  second  would  be  the  difference  between  the 
means  of  the  covariable  in  the  injured  and  reference  areas.  It  is 
quite  common  for  the  product  of  these  two  factors  to  be  rather 
trivial.  From  the  mathematical  results  it  can  be  deduced  that  the 
associations  (1)  between  the  covariable  and  area  or  (2)  between  the 
covariable  and  the  outcome  must  be  relatively  stronger  to  cause  an 
adjustment  that  is  "important"  relative  to  a  large  observed 
difference.  This  is  the  basis  for  the  generally  and  correctly  held 
view  that  it  is  less  likely  for  a  covariable  adjustment  to  have  an 
"important"  impact  on  a  large  observed  difference.  The  ASR 
incorrectly  disputes  this  view. 

Generally,  large  injuries  are  found  in  the  NRDA.  Thus  any  changes 
resulting  from  covariance  adjustment  are  not  likely  to  have  much 
effect  on  the  conclusions. 

C.    Spatial  Analysis.   (S10;  05acd,8;  61,5a;  T8) 

The  ASR  recommends  use  of  spatial  analysis  in  some  of  the  injury 
quantifications  for  two  main  purposes.  The  first  is  to  provide 
different  weighting  of  the  data  from  the  wells  within  an  injured  or 
reference  area.  The  second  is  to  obtain  standard  errors  to 
supplement  point  estimates. 

In  fact,  there  are  many  approaches  to  spatial  analysis  with 
different  models  and  estimation  methods.  All  require  some 
assumption  about  the  modeling  of  broad  spatial  trends  in  the  area 
(the  fixed  effects  part  of  the  spatial  model)  and  all  require  some 

23 


assumption  about  the  spatial  autocorrelation  structure.  Would 
spatial  autocorrelation  be  assumed  the  same  in  all  directions? 
Would  a  parametric  autocorrelation  model  or  a  non-parametric  model 
be  used?  Would  a  measurement  error  component  be  included  in  the 
autocorrelation  model  (nugget  effect)  or  not?  Would  a  trend  model 
be  included  or  would  spatial  variability  be  estimated  entirely  via 
the  covariance  structure  (random  effects)  part  of  the  spatial 
analysis  model?  What  transformations  of  the  outcome  variable  would 
be  used  to  give  outcome  distributions  that  are  more  consistent  with 
existing  spatial  analysis  technology?  It  certainly  would  not  be 
reasonable  to  reanalyze  the  data  using  a  variety  of  models  and  then 
just  present  the  one  that  gives  results  that  are  most  favorable. 

Rather  then  use  elaborate  spatial  model  based  analyses,  which 
requires  disputable  modeling  assumptions,  the  State  has  chosen  to 
use  simpler  cost-effective  analysis  methods  with  emphasis  on 
producing  point  estimates. 

As  previously  discussed,  we  consider  the  point  estimates  to  be  the 
primary  information  that  is  needed  for  situations  like  an  NRDA 
where  the  goal  is  to  use  cost-effective  methods  leading  to  a 
balanced  decision  that  is  fair  to  both  sides.  Standard  errors 
merely  indicate  it  is  possible  that  the  truth  could  be  lower  or 
higher  than  the  point  estimate.  We  do  not  consider  standard  errors 
or  related  confidence  intervals  crucial  for  reaching  an  unbiased 
conclusion  when  significant  p-values  have  been  obtained. 

It  also  is  unlikely  that  different  weighting  of  individual  sites, 
which  might  result  from  a  spatial  analysis,  would  make  much 
difference  in  the  point  estimates.  In  situations  where  spatial 
analysis  could  be  used,  the  distributions  of  data  in  injured  areas 
are  typically  substantially  shifted  compared  to  the  distributions 
in  corresponding  reference  areas.  In  many  instances  there  is 
virtually  no  overlap.  Any  reweighted  means  or  medians  would  still 
be  close  to  the  centers  of  these  distributions  with  very  little 
change  in  the  substance  of  the  results.  Moreover,  it  is  just  as 
likely  that  results  from  spatial  analysis  would  show  increased 
injury  as  diminished  injury.  The  simpler  analysis  methods  used  by 
the  State  provide  strong  indication  that  the  resources  involved  in 
this  NRDA  are  substantially  injured  with  larger  differences  than 
can  be  explained  by  chance  alone. 


24 


:      •      :  «         :  :  . 

:  :  :  i 

:  : 

0  i 

'. ; V  -V  ■-', 1 

.     i  *       !       •  i  J 

■    :  :  :  : 

■  °  i .o- . 

:  ;  ■  : 

i  :  :  °   ' 

o  i  ' 

i > \ \ I 

:      .      i  I 

c 

o  i 

I 
j  ......... j        -  " *  i "  I 

o  c    > 

•  I 

O  i 


0 

0  i       j    .  ;       i    •  I 

: 1 : '.") i 

'    i  !  :  • 

j     ■  ;     "  ;       j 

i  :  : 

" • 

o 

■   : 

:  o  : 

:  o    : 

j j..  ....... j......... j- j 

| 

o 

i  i.  :  f 

'o 


s 

o 

■a 

3 


& 

■a 
z> 
^*  tj 

d     g 

.s  » 

5   >> 
a  " 

£-& 

c  a 

-    a 

ag 

S^ 

S   o 

U    "O 


aS 
E  2 
a  £ 


=  a 

0    u 

—    S 


E  0 


Attachment   1 


L 


